| Name | NetID | |
|---|---|---|
| Artsiom Strok | astrok2 | astrok2@illinois.edu |
| Lin Jiang | linj3 | linj3@illinois.edu |
| Mayank Chhablani | mchhab2 | mchhab2@illinois.edu |
Life expectancy has increased dramatically over the last few centuries. Since 1900 the global average life expectancy has more than doubled and there has been a huge development in health sector in the past 15 years resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project, only data from year 2000-2015 is considered for exploration and analysis.
This dataset is a compilation of data from the Global Health Observatory (GHO) and United Nations. The GHO data repository is WHO’s gateway to health-related statistics which provides access to a variety of indicators on priority health topics including mortality and burden of diseases, environmental health, violence and injuries etc. (http://apps.who.int/gho/data/node.resources). The economic data such as GDP is collected from the National Accounts Main Aggregates Database under United Nations which collects and disseminates economic statistics from countries worldwide (https://unstats.un.org/unsd/snaama/Index).
This dataset is cleaned by removing some missing values, maily for population, Hepatitis B and GDP from less known countries and shared on Kaggle website (https://www.kaggle.com/kumarajarshi/life-expectancy-who). The final dataset contains 2938 observations with 22 variables which are more critical and representative among all the categories of health-related factors from year 2000 - 2015 for 193 countries.
The description of each variable for this dataset is listed below:
Numerical Response
Life expectancy: Life Expectancy in ageNumerical Predictors
Year: YearAdult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)infant: deathsNumber of Infant Deaths per 1000 populationAlcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)Measles: Measles - number of reported cases per 1000 populationBMI: Average Body Mass Index of entire populationunder-five deaths: Number of under-five deaths per 1000 populationPolio: Polio (Pol3) immunization coverage among 1-year-olds (%)Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)GDP: Gross Domestic Product per capita (in USD)Population: Population of the countrythinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)Income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)Schooling: Number of years of Schooling(years)Categotical Predictors
Country: CountryStatus: Developed or Developing statusThe GHO data repository is WHO’s gateway to health-related statistics which provides access to a variety of indicators on priority health topics including mortality and burden of diseases, environmental health, violence and injuries etc. (http://apps.who.int/gho/data/node.resources).
The economic data such as GDP is collected from the National Accounts Main Aggregates Database under United Nations which collects and disseminates economic statistics from countries worldwide (https://unstats.un.org/unsd/snaama/Index).
By combining the data from these two databases and exploring the interactions between the variables such as immunization, moratlity, education and economic factors, we expect to build a robust multiple linear regression model to predict the life expectancy more accurately.
For this project, we would like to:
library(knitr)
data = read.csv("Life Expectancy Data.csv")
kable(t(data[sample(nrow(data), 5), ]))
| 2508 | 912 | 2358 | 976 | 2041 | |
|---|---|---|---|---|---|
| Country | Sweden | Fiji | Slovenia | Gambia | Poland |
| Year | 2013 | 2002 | 2003 | 2002 | 2014 |
| Status | Developed | Developing | Developed | Developing | Developed |
| Life.expectancy | 81.9 | 67.9 | 76.5 | 56.6 | 77.3 |
| Adult.Mortality | 57 | 22 | 119 | 298 | 12 |
| infant.deaths | 0 | 0 | 0 | 3 | 2 |
| Alcohol | 7.30 | 1.85 | 11.58 | 2.08 | 10.71 |
| percentage.expenditure | 1212.7 | 206.1 | 203.3 | 0.0 | 243.8 |
| Hepatitis.B | 67 | 99 | NA | 92 | 96 |
| Measles | 51 | 304 | 0 | 32 | 0 |
| BMI | 58.5 | 52.2 | 52.9 | 19.1 | 61.1 |
| under.five.deaths | 0 | 0 | 0 | 6 | 2 |
| Polio | 98 | 94 | 95 | 86 | 94 |
| Total.expenditure | 11.97 | 3.48 | 8.77 | 3.75 | 6.35 |
| Diphtheria | 98 | 93 | 95 | 87 | 98 |
| HIV.AIDS | 0.1 | 0.1 | 0.1 | 2.5 | 0.1 |
| GDP | 6283 | 2260 | 1488 | NA | 14342 |
| Population | 96379 | 815691 | 1995733 | NA | 3811735 |
| thinness..1.19.years | 1.4 | 4.2 | 2.0 | 9.8 | 1.9 |
| thinness.5.9.years | 1.3 | 3.9 | 2.1 | 9.8 | 2.1 |
| Income.composition.of.resources | 0.904 | 0.687 | 0.843 | 0.392 | 0.850 |
| Schooling | 15.8 | 13.3 | 16.1 | 7.0 | 16.4 |
original_data = read.csv("Life Expectancy Data.csv")
kable(t(original_data[sample(nrow(original_data), 5), ]))
| 1380 | 2304 | 1125 | 1029 | 2525 | |
|---|---|---|---|---|---|
| Country | Kiribati | Sierra Leone | Haiti | Greece | Switzerland |
| Year | 2014 | 2009 | 2013 | 2013 | 2012 |
| Status | Developing | Developing | Developing | Developing | Developed |
| Life.expectancy | 66.1 | 47.1 | 62.7 | 86.0 | 82.7 |
| Adult.Mortality | 2 | 433 | 253 | 74 | 54 |
| infant.deaths | 0 | 28 | 14 | 0 | 0 |
| Alcohol | 0.01 | 3.97 | 5.68 | 7.46 | 9.86 |
| percentage.expenditure | 97.87 | 49.84 | 4.99 | 2183.11 | 18379.33 |
| Hepatitis.B | 75 | 84 | 68 | 98 | NA |
| Measles | 0 | 31 | 0 | 3 | 61 |
| BMI | 77.1 | 21.2 | 47.7 | 65.4 | 56.2 |
| under.five.deaths | 0 | 42 | 19 | 0 | 0 |
| Polio | 79 | 81 | 67 | 99 | 96 |
| Total.expenditure | 1.21 | 13.13 | 8.10 | 9.26 | 11.59 |
| Diphtheria | 75 | 84 | 68 | 99 | 96 |
| HIV.AIDS | 0.1 | 1.7 | 0.5 | 0.1 | 0.1 |
| GDP | 1684.54 | 394.59 | 81.27 | 21874.82 | 83164.39 |
| Population | 11458 | 63126 | 1431776 | 1965211 | 7996861 |
| thinness..1.19.years | 0.1 | 8.5 | 3.9 | 0.8 | 0.5 |
| thinness.5.9.years | 0.1 | 8.4 | 3.9 | 0.7 | 0.3 |
| Income.composition.of.resources | 0.597 | 0.375 | 0.483 | 0.860 | 0.932 |
| Schooling | 11.9 | 8.5 | 9.1 | 17.1 | 15.7 |
kable(sort(colSums(is.na(original_data)), decreasing = TRUE), col.names = "Number of missing values")
| Number of missing values | |
|---|---|
| Population | 652 |
| Hepatitis.B | 553 |
| GDP | 448 |
| Total.expenditure | 226 |
| Alcohol | 194 |
| Income.composition.of.resources | 167 |
| Schooling | 163 |
| BMI | 34 |
| thinness..1.19.years | 34 |
| thinness.5.9.years | 34 |
| Polio | 19 |
| Diphtheria | 19 |
| Life.expectancy | 10 |
| Adult.Mortality | 10 |
| Country | 0 |
| Year | 0 |
| Status | 0 |
| infant.deaths | 0 |
| percentage.expenditure | 0 |
| Measles | 0 |
| under.five.deaths | 0 |
| HIV.AIDS | 0 |
1289 samples have at least one missing value. Alcohol is missing in 194 samples all of them belongs to 2015 year and these countiries definately should have alcohol consuption more than 0. The data was collected in 2015 when data about alcohol consumption simply was not available.
Life expextancy and adult mortality is missing for 10 samples in 2013, all of them belongs to islands.
Hepatitis B is missing in 553 samples. Samples belongs to different countries and years.
nrow(original_data)
## [1] 2938
data = na.omit(original_data)
nrow(original_data) - nrow(data)
## [1] 1289
data = data[-1] #exclude country from dataset
kable(t(do.call(cbind, lapply(data, summary))))
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Year | 2000.000 | 2005.000 | 2.008e+03 | 2.008e+03 | 2.011e+03 | 2.015e+03 |
| Status | 242.000 | 1407.000 | 2.420e+02 | 1.407e+03 | 2.420e+02 | 1.407e+03 |
| Life.expectancy | 44.000 | 64.400 | 7.170e+01 | 6.930e+01 | 7.500e+01 | 8.900e+01 |
| Adult.Mortality | 1.000 | 77.000 | 1.480e+02 | 1.682e+02 | 2.270e+02 | 7.230e+02 |
| infant.deaths | 0.000 | 1.000 | 3.000e+00 | 3.255e+01 | 2.200e+01 | 1.600e+03 |
| Alcohol | 0.010 | 0.810 | 3.790e+00 | 4.533e+00 | 7.340e+00 | 1.787e+01 |
| percentage.expenditure | 0.000 | 37.439 | 1.451e+02 | 6.990e+02 | 5.094e+02 | 1.896e+04 |
| Hepatitis.B | 2.000 | 74.000 | 8.900e+01 | 7.922e+01 | 9.600e+01 | 9.900e+01 |
| Measles | 0.000 | 0.000 | 1.500e+01 | 2.224e+03 | 3.730e+02 | 1.314e+05 |
| BMI | 2.000 | 19.500 | 4.370e+01 | 3.813e+01 | 5.580e+01 | 7.710e+01 |
| under.five.deaths | 0.000 | 1.000 | 4.000e+00 | 4.422e+01 | 2.900e+01 | 2.100e+03 |
| Polio | 3.000 | 81.000 | 9.300e+01 | 8.356e+01 | 9.700e+01 | 9.900e+01 |
| Total.expenditure | 0.740 | 4.410 | 5.840e+00 | 5.956e+00 | 7.470e+00 | 1.439e+01 |
| Diphtheria | 2.000 | 82.000 | 9.200e+01 | 8.416e+01 | 9.700e+01 | 9.900e+01 |
| HIV.AIDS | 0.100 | 0.100 | 1.000e-01 | 1.984e+00 | 7.000e-01 | 5.060e+01 |
| GDP | 1.681 | 462.150 | 1.593e+03 | 5.566e+03 | 4.719e+03 | 1.192e+05 |
| Population | 34.000 | 191897.000 | 1.420e+06 | 1.465e+07 | 7.659e+06 | 1.294e+09 |
| thinness..1.19.years | 0.100 | 1.600 | 3.000e+00 | 4.851e+00 | 7.100e+00 | 2.720e+01 |
| thinness.5.9.years | 0.100 | 1.700 | 3.200e+00 | 4.908e+00 | 7.100e+00 | 2.820e+01 |
| Income.composition.of.resources | 0.000 | 0.509 | 6.730e-01 | 6.316e-01 | 7.510e-01 | 9.360e-01 |
| Schooling | 4.200 | 10.300 | 1.230e+01 | 1.212e+01 | 1.400e+01 | 2.070e+01 |
Looking at the summary data we can already see some inconsistencies. In Infant Deaths we see that the max value listed is 1600 which doesn’t make sense since we’re working with per 1000 population data. The same or similar numbers we can see for Infant deaths, Measles, Under five deaths
boxplot(data$Adult.Mortality)
boxplot(data$infant.deaths)
boxplot(data$Measles)
boxplot(data$under.five.deaths)
This original dataset contains Country variable which we would not use for model building. Therefore, this column is excluded from the following data analysis as well as the records with NA in some of the columns mentioned above. This will reduce the size of the original dataset from 2938 to 1649 rows, which captures majority of the information and allows the speed of modeling to be more efficient.
Let’s take a look on histograms for each attribute in the dataset.
ggplot(gather(data[-2]), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
Let’s take a look of the dataset using plots of Life Expectancy vs Status or Year. The boxplot indicates that there is significant difference in Life Expectancy between the Developed and Developing countries. As expected, the Life Expectancy increases as the Year passes by and the Violin plot shows that the data is well distributed across different years.
par(mfrow = c(1, 2))
# Histogram of Life Expectancy
hist(
data$Life.expectancy,
xlab = "Life Expectancy",
main = "Distribution of Life Expectancy",
col = "dodgerblue",
breaks = 25
)
# Boxplot of Life Expectancy vs Country Status (Developed vs Developing)
plot(
data$Status,
data$Life.expectancy,
xlab = "Status",
ylab = "Life Expectancy",
main = "Life Expectancy vs. Status",
col = c(2, 3)
)
# Violin plot of Life Expectancy vs. Year
data %>% ggplot() + geom_violin(aes(
x = Year,
y = Life.expectancy,
group = Year,
fill = Year
)) + labs(title = "Life Expectancy vs. Year")
Colinearity issue is visualized using the following plots. Resulsts show that some predictors have strong collinearity issues, such as infant.deaths vs. under.five.deaths, GDP vs. percentage.expenditure, Population vs. thinness..1.19.years etc.
corrplot(
cor(data[-c(2)]),
method = "color",
order = "hclust",
type = "lower",
outline = TRUE,
tl.col = "black",
addCoef.col = "darkgreen",
)
ggpairs(data[-c(2)])
# train test split 70/30 hold out
train_size = floor(0.7 * nrow(data))
train_idx = sample(nrow(data), train_size)
data.train = data[train_idx, ]
data.test = data[-train_idx, ]
Steps to generate best model:
We will start with simple additive model and then use AIC/ BIC to weed out unnecessary predictors. We then, may try out interactions and see if that improves our model and later, we may use response or, predictor transformations to improve our selected model.
Additve Model with all predictors
model.additive = lm(Life.expectancy ~ ., data.train)
#summary(model.additive)
par(mfrow = c(2, 2))
plot(model.additive)
So, looking at Residual vs Fitted plot that we may have issue of linearity and equal-variance. Similary, we need to check for normality assumption as dictated by Normal Q-Q plot
Let’s check for Homoscedasticityand Normality assumptions.
show_metrics(model.additive)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 3.19638743734941e-21 | 0.000059802581149761 | 0.844706587093747 | 3.50866876426059 | 11.8303259589542 | 14.4884239442948 |
So, from above data it looks liek although Adjusted R^2 and RMSE is better, but, we REJECT both Homoscedasticityand Normality assumptions. Also, from summary of model it seems there is need to prune out some predictors.
Let’s apply AIC and BIC to see if we can improve our model
model.additive.step.aic = step(model.additive, trace = FALSE)
par(mfrow = c(2, 2))
plot(model.additive.step.aic)
So, looking at AIC model, we can see that GDPvs percentage.expenditure collinearity issue has been removed by AIC. Also, some of the non-significant predictors are also removed.
show_metrics(model.additive.step.aic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 2.61540494570909e-22 | 0.0000318962005371406 | 0.84483096513633 | 3.50080398499081 | 11.8834501668962 | 14.4433330629638 |
Apart from reduction in predictors, there isn’t a significant improvement. Let’s try out BIC:
model.additive.step.bic = step(model.additive, k = log(nrow(data.train)), trace = FALSE)
par(mfrow = c(2, 2))
plot(model.additive.step.bic)
show_metrics(model.additive.step.bic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 2.12322671490772e-20 | 0.0000133651092912547 | 0.842918941917074 | 3.51818636644947 | 12.0721275312972 | 14.5125336442044 |
So, BIC and AIC resutls are almost same and so does the evaluatin metric but, we saw two improvements: (1) Reduction in number of predictors (2) Reduction in influential points as per the Residual vs Leverage graph.
Hence, let’s select AIC model and try interactions. The BIC is little aggressive in removing predictors so, we are selecting AIC additive model.
Also, let’s check for variance-inflation-factor
vif_aic_all_model = vif(model.additive.step.aic)
vif_aic_all_model[vif_aic_all_model > 5]
## infant.deaths under.five.deaths
## 225.2 220.1
As evident from above VIF computation, infant.deaths and under.five.deaths in the model are causing collinearity issue. This same issue we found during our correlation plot. Let’s try to improve the model and remove issues which are violating linearity assumptions.
Applying Two-Way Interaction:
model.interaction = lm(Life.expectancy ~ . ^ 2, data = data.train)
Since, this will explode the number of predictors let’s generate AIC and BIC models for this interaction model.
AIC Interaction Model:
model.interaction.step.aic = step(model.interaction, trace = 0)
par(mfrow = c(2, 2))
plot(model.interaction.step.aic)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
show_metrics(model.interaction.step.aic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.00598315752593769 | 5.02438486442735e-10 | 0.918642514611818 | 2.89436801651001 | 5.59611868899128 | 15.3358935315704 |
AIC and BIC model are again somewhat similar. Total numbr of predictors used by AIC is 119 and we can see that with interactions we have made improvement in adjusted r_squared and bptest. Let’s see the VIF imapct
vif_interaction_model = vif(model.interaction.step.aic)
vif_interaction_model[vif_interaction_model > 5]
## Year
## 9.463e+01
## StatusDeveloping
## 1.206e+03
## Adult.Mortality
## 7.079e+05
## infant.deaths
## 3.197e+08
## Alcohol
## 1.245e+02
## percentage.expenditure
## 8.236e+06
## Hepatitis.B
## 7.088e+01
## Measles
## 8.146e+02
## BMI
## 1.094e+02
## under.five.deaths
## 3.528e+08
## Polio
## 6.172e+01
## Total.expenditure
## 4.558e+05
## Diphtheria
## 1.709e+02
## HIV.AIDS
## 1.108e+06
## GDP
## 8.047e+06
## Population
## 3.971e+06
## thinness..1.19.years
## 9.647e+06
## thinness.5.9.years
## 9.988e+06
## Income.composition.of.resources
## 1.820e+06
## Schooling
## 1.380e+06
## Year:Adult.Mortality
## 7.089e+05
## Year:infant.deaths
## 3.238e+08
## Year:percentage.expenditure
## 8.243e+06
## Year:under.five.deaths
## 3.577e+08
## Year:Total.expenditure
## 4.577e+05
## Year:HIV.AIDS
## 1.107e+06
## Year:GDP
## 8.075e+06
## Year:Population
## 4.022e+06
## Year:thinness..1.19.years
## 9.691e+06
## Year:thinness.5.9.years
## 1.003e+07
## Year:Income.composition.of.resources
## 1.825e+06
## Year:Schooling
## 1.389e+06
## StatusDeveloping:Adult.Mortality
## 8.924e+01
## StatusDeveloping:infant.deaths
## 1.800e+06
## StatusDeveloping:Hepatitis.B
## 6.603e+01
## StatusDeveloping:under.five.deaths
## 2.288e+06
## StatusDeveloping:Total.expenditure
## 2.777e+01
## StatusDeveloping:Diphtheria
## 4.078e+02
## StatusDeveloping:thinness..1.19.years
## 3.142e+04
## StatusDeveloping:thinness.5.9.years
## 2.619e+04
## StatusDeveloping:Income.composition.of.resources
## 1.092e+03
## StatusDeveloping:Schooling
## 2.695e+02
## Adult.Mortality:percentage.expenditure
## 4.291e+01
## Adult.Mortality:BMI
## 1.228e+01
## Adult.Mortality:under.five.deaths
## 4.157e+01
## Adult.Mortality:HIV.AIDS
## 1.225e+01
## Adult.Mortality:GDP
## 4.236e+01
## Adult.Mortality:Population
## 2.845e+01
## Adult.Mortality:Income.composition.of.resources
## 5.368e+01
## Adult.Mortality:Schooling
## 9.330e+01
## infant.deaths:Alcohol
## 3.618e+01
## infant.deaths:percentage.expenditure
## 8.951e+02
## infant.deaths:Hepatitis.B
## 5.134e+03
## infant.deaths:BMI
## 1.648e+03
## infant.deaths:under.five.deaths
## 4.808e+02
## infant.deaths:Polio
## 2.376e+02
## infant.deaths:Diphtheria
## 2.092e+04
## infant.deaths:HIV.AIDS
## 1.872e+01
## infant.deaths:GDP
## 1.577e+03
## infant.deaths:Population
## 7.406e+04
## infant.deaths:thinness..1.19.years
## 1.385e+04
## infant.deaths:Income.composition.of.resources
## 4.829e+03
## Alcohol:Hepatitis.B
## 4.749e+01
## Alcohol:Total.expenditure
## 3.343e+01
## Alcohol:Diphtheria
## 1.126e+02
## Alcohol:HIV.AIDS
## 1.597e+01
## Alcohol:Population
## 2.349e+01
## Alcohol:thinness..1.19.years
## 1.078e+01
## percentage.expenditure:Hepatitis.B
## 2.800e+01
## percentage.expenditure:under.five.deaths
## 1.175e+03
## percentage.expenditure:Diphtheria
## 4.224e+02
## percentage.expenditure:GDP
## 2.275e+01
## percentage.expenditure:Income.composition.of.resources
## 2.767e+03
## percentage.expenditure:Schooling
## 2.626e+03
## Hepatitis.B:Measles
## 6.286e+01
## Hepatitis.B:under.five.deaths
## 5.359e+03
## Hepatitis.B:Total.expenditure
## 3.663e+01
## Hepatitis.B:Diphtheria
## 2.463e+01
## Hepatitis.B:Population
## 8.144e+01
## Hepatitis.B:thinness..1.19.years
## 5.593e+02
## Hepatitis.B:thinness.5.9.years
## 5.507e+02
## Measles:BMI
## 1.992e+01
## Measles:Polio
## 2.424e+02
## Measles:Total.expenditure
## 3.579e+01
## Measles:Diphtheria
## 3.718e+02
## Measles:HIV.AIDS
## 9.406e+00
## Measles:GDP
## 5.379e+00
## Measles:Population
## 1.781e+02
## Measles:thinness.5.9.years
## 2.053e+02
## Measles:Income.composition.of.resources
## 3.891e+02
## BMI:under.five.deaths
## 1.656e+03
## BMI:GDP
## 1.799e+01
## BMI:Population
## 4.187e+01
## BMI:thinness..1.19.years
## 1.681e+02
## BMI:thinness.5.9.years
## 1.615e+02
## BMI:Income.composition.of.resources
## 1.320e+02
## BMI:Schooling
## 1.883e+02
## under.five.deaths:Diphtheria
## 1.845e+04
## under.five.deaths:GDP
## 1.908e+03
## under.five.deaths:Population
## 6.835e+04
## under.five.deaths:thinness..1.19.years
## 1.394e+04
## under.five.deaths:thinness.5.9.years
## 9.257e+02
## under.five.deaths:Schooling
## 1.231e+03
## Polio:Total.expenditure
## 5.683e+01
## Polio:GDP
## 1.340e+02
## Polio:thinness..1.19.years
## 2.202e+02
## Polio:thinness.5.9.years
## 2.050e+02
## Polio:Income.composition.of.resources
## 1.081e+02
## Total.expenditure:Diphtheria
## 7.080e+01
## Total.expenditure:HIV.AIDS
## 3.156e+01
## Total.expenditure:GDP
## 1.741e+01
## Total.expenditure:Population
## 8.373e+01
## Total.expenditure:thinness..1.19.years
## 1.884e+01
## Total.expenditure:Income.composition.of.resources
## 6.184e+01
## Diphtheria:Income.composition.of.resources
## 1.361e+02
## Diphtheria:Schooling
## 1.910e+02
## HIV.AIDS:Population
## 5.303e+00
## HIV.AIDS:thinness..1.19.years
## 1.491e+02
## HIV.AIDS:thinness.5.9.years
## 1.469e+02
## GDP:thinness.5.9.years
## 6.555e+00
## GDP:Schooling
## 2.859e+03
## Population:thinness.5.9.years
## 4.160e+02
## Population:Schooling
## 7.387e+02
## thinness..1.19.years:thinness.5.9.years
## 6.667e+01
## thinness..1.19.years:Income.composition.of.resources
## 4.919e+03
## thinness..1.19.years:Schooling
## 3.574e+03
## thinness.5.9.years:Income.composition.of.resources
## 4.938e+03
## thinness.5.9.years:Schooling
## 3.794e+03
This interaction model is getting overly complicated. Let’s inspect pairs scatter plot.
pairs(data.frame(data$Life.expectancy, data$Year, data$Status, data$Adult.Mortality, data$infant.deaths, data$Alcohol, data$percentage.expenditure, data$BMI, data$under.five.deaths, data$Polio, data$Total.expenditure, data$HIV.AIDS, data$thinness.5.9.years, data$Income.composition.of.resources, data$Schooling))
Based on this cross-predictor plots, we confirm that (a) HIV.AIDS has non-linear relationship with Life.expectancy' and, (B)HIV.AIDSandAdult.Mortalityare also interacting non-linearly. Also, we may condsider removingYear` because it may divert model in different direction because of its chaotic relationship with the response.
Check for impact of high VIF predictors.
model_selected_wo_underfive = lm(
Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data.train
)
model_selected_wo_infant.deaths = lm(
Life.expectancy ~ Year + Status + Adult.Mortality + under.five.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data.train
)
model_selected_underfive = lm(
under.five.deaths ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data.train
)
model_selected_infant.deaths = lm(
infant.deaths ~ Year + Status + Adult.Mortality + under.five.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling,
data = data.train
)
cor(resid(model_selected_underfive),
resid(model_selected_wo_underfive))
## [1] -0.2527
cor(resid(model_selected_infant.deaths),
resid(model_selected_wo_infant.deaths))
## [1] 0.2487
Based on this we can get rid of under.five.deaths
model_selected = lm(
Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling ,
data = data.train
)
par(mfrow = c(2, 2))
plot(model_selected)
show_metrics(model_selected)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 5.59572259383389e-29 | 0.0000817238216404277 | 0.833747060667492 | 3.62603598753409 | 12.7434773099163 | 14.5659873637219 |
vif(model_selected)
## Year StatusDeveloping
## 1.121 1.802
## Adult.Mortality infant.deaths
## 1.837 1.392
## Alcohol percentage.expenditure
## 2.246 1.418
## BMI Polio
## 1.784 1.199
## Total.expenditure HIV.AIDS
## 1.124 1.576
## thinness.5.9.years Income.composition.of.resources
## 1.996 3.263
## Schooling
## 3.679
Looks like none of our predictor is causing inflation in variance. Let’s consider interaction of these predictors.
model_selected_interactions = lm(
Life.expectancy ~ (
Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling
) ^ 2 ,
data = data.train
)
par(mfrow = c(2, 2))
plot(model_selected_interactions)
show_metrics(model_selected_interactions)
## Warning in predict.lm(model, data.train): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(model, data.test): prediction from a rank-deficient fit
## may be misleading
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 4.84360148121369e-06 | 3.58602549267733e-09 | 0.89981396499582 | 2.96815448606524 | 7.16067923906753 | 10.356646923468 |
model_selected_interactions_aic = step(model_selected_interactions, trace = FALSE)
#summary(model_selected_interactions_aic)
show_metrics(model_selected_interactions_aic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.44805155269171e-06 | 1.04271639578767e-09 | 0.901025514701528 | 2.87016492853926 | 7.27373007030712 | 10.5189876384089 |
model_selected_interactions_bic=step(model_selected_interactions, trace=0, k = log(nrow(data.train)))
#summary(model_selected_interactions_bic)
show_metrics(model_selected_interactions_bic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.09684726187374e-08 | 2.12464993566629e-10 | 0.896246382310163 | 2.9003927388848 | 7.77145266043916 | 10.7769449544755 |
par(mfrow=c(2,2))
plot(model_selected_interactions_bic)
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + infant.deaths:Polio + Adult.Mortality:HIV.AIDS + I(HIV.AIDS^2) , data=data.train)
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions)
show_metrics(model_selected_poly_interactions)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 9.90771175086473e-12 | 7.59134585416287e-06 | 0.85857538224716 | 3.34510211995818 | 10.811830900963 | 12.8744470461626 |
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +I(HIV.AIDS^2) , data=data.train)
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions)
show_metrics(model_selected_poly_interactions)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 3.22341712873213e-11 | 7.76432430757746e-06 | 0.857347354152492 | 3.35556741522419 | 10.915304600423 | 12.7046394784675 |
vif_poly = vif(model_selected_poly_interactions)
vif_poly[vif_poly > 5]
## HIV.AIDS I(HIV.AIDS^2) Adult.Mortality:HIV.AIDS
## 13.817 9.387 6.205
length(coef(model_selected_poly_interactions))
## [1] 16
model_selected_poly_interactions=lm(Life.expectancy ~ Year + Status + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure + BMI + Polio + Total.expenditure + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS +I(HIV.AIDS^2) , data=data.train)
model_selected_poly_interactions_aic=step(model_selected_poly_interactions, trace=0, k=log(nrow(data)))
par(mfrow = c(2, 2))
plot(model_selected_poly_interactions_aic)
show_metrics(model_selected_poly_interactions_aic)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.30367180064734e-08 | 2.64023093149332e-06 | 0.855313070446564 | 3.37128342500836 | 11.1196033881997 | 13.0238764269319 |
vif_poly = vif(model_selected_poly_interactions_aic)
vif_poly[vif_poly > 5]
## HIV.AIDS I(HIV.AIDS^2) Adult.Mortality:HIV.AIDS
## 13.115 9.183 6.023
length(coef(model_selected_poly_interactions_aic))
## [1] 11
model_selected_outliers = lm(
Life.expectancy ~ Adult.Mortality +
infant.deaths + Polio + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources +
Schooling + Adult.Mortality:HIV.AIDS + I(HIV.AIDS ^ 2),
data = data.train,
subset = abs(rstandard(model_selected_poly_interactions_aic)) < 2
)
par(mfrow = c(2, 2))
plot(model_selected_outliers)
show_metrics(model_selected_outliers)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.98371246826394e-22 | 0.0785951289988709 | 0.902757127129399 | 2.69776680411486 | 11.6767587462108 | 14.2696356612668 |
Based on above plots, we can say that we have achieved a good model based with 10 predictors and the formula is:
formula = Life.expectancy ~ Adult.Mortality + infant.deaths + Polio + HIV.AIDS + thinness.5.9.years + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS + I(HIV.AIDS^2)
Also, looking at the model it seems, we are satisfying assumptions of linear model.
Evaluation metric are as follows:
show_metrics(model_selected_outliers)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.98371246826394e-22 | 0.0785951289988709 | 0.902757127129399 | 2.69776680411486 | 11.6767587462108 | 14.2696356612668 |
Box-Cox confirms that it is not necessary to transform y variable.
# Box-Cox Transformation
par(mfrow = c(1, 2))
boxcox(model_selected_outliers, plotit = TRUE)
boxcox(model_selected_outliers,
plotit = TRUE,
lambda = seq(0.5, 1.5, by = 0.1))
Explore the relationship between Life.expectancy and each predictor.
plot(Life.expectancy ~ Adult.Mortality, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Adult.Mortality")
plot(Life.expectancy ~ infant.deaths, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs infant.deaths")
plot(Life.expectancy ~ Polio, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Polio")
plot(Life.expectancy ~ HIV.AIDS, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs HIV.AIDS")
plot(Life.expectancy ~ thinness.5.9.years, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs thinness.5.9.years")
plot(Life.expectancy ~ Income.composition.of.resources, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Income.composition.of.resources")
plot(Life.expectancy ~ Schooling, data = data.train, col = "grey", pch = 20, cex = 1.5,
main = "Life.expectancy vs Schooling")
lm_le_am = lm(Life.expectancy ~ Adult.Mortality, data = data.train)
show_metrics(lm_le_am)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.93888870084312e-18 | 5.44652981244873e-27 | 0.484496184454426 | 6.32984080904256 | 39.9298914485206 | 37.4349744962051 |
lm_le_am_log = lm(Life.expectancy ~ log(Adult.Mortality), data = data.train)
show_metrics(lm_le_am_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.0000109121008769902 | 4.88198600292786e-17 | 0.183382442308834 | 7.97190730155398 | 63.2535578792075 | 61.5102146650863 |
# bp.test result is significantly improved with log transformation of Adult.Mortality.
lm_le_am_poly = lm(Life.expectancy ~ Adult.Mortality + I(Adult.Mortality ^ 2), data = data.train)
show_metrics(lm_le_am_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 3.17932361175982e-25 | 2.76151917851856e-24 | 0.499632136383793 | 6.24283249651564 | 38.7238473195189 | 36.0427661597144 |
# Polynomial transformation doesn't work.
lm_le_id = lm(Life.expectancy ~ infant.deaths, data = data.train)
show_metrics(lm_le_id)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.02657747833814 | 8.27589915106478e-17 | 0.0284300058581662 | 8.69566904890079 | 75.2558627711913 | 74.8725034661356 |
# lm_le_id_log = lm(Life.expectancy ~ log(infant.deaths), data = data.train)
# couldn't perform log transformation because of O values.
lm_le_id_poly = lm(Life.expectancy ~ infant.deaths + I(infant.deaths ^ 2), data = data.train)
show_metrics(lm_le_id_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.000298263106736209 | 2.51398765428739e-15 | 0.0893880358135724 | 8.42415869521536 | 70.4729484696273 | 71.4095417460136 |
# Polynomial transformation of infant.deaths doesn't improve.
lm_le_po = lm(Life.expectancy ~ Polio, data = data.train)
show_metrics(lm_le_po)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 2.66387530899153e-06 | 1.15592951036877e-13 | 0.119174477527461 | 8.27688462546276 | 68.2269780296235 | 71.0842938162934 |
lm_le_po_log = lm(Life.expectancy ~ log(Polio), data = data.train)
show_metrics(lm_le_po_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.0318322360510644 | 9.61459294344335e-16 | 0.0571717899249146 | 8.56104117786634 | 73.0295818335663 | 74.7811670558807 |
# bp.test result is improved with log transformation of Polio.
lm_le_po_poly = lm(Life.expectancy ~ Polio + I(Polio ^ 2), data = data.train)
show_metrics(lm_le_po_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.0200803369616035 | 6.03245675563947e-12 | 0.235744603268793 | 7.71173926979946 | 59.1463030464273 | 60.9914024560467 |
# Polynomial transformation also helps.
lm_le_hiv = lm(Life.expectancy ~ HIV.AIDS, data = data.train)
show_metrics(lm_le_hiv)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.33841446544784e-10 | 7.77718772625768e-07 | 0.362320502385172 | 7.04958025653526 | 49.3933746964793 | 52.1499460150461 |
lm_le_hiv_log = lm(Life.expectancy ~ log(HIV.AIDS), data = data.train)
show_metrics(lm_le_hiv_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.00398413313373982 | 5.03204393449706e-06 | 0.644648154468367 | 5.25460096055283 | 27.5248411170202 | 29.573112720294 |
# bp.test result is significantly improved with log transformation of HIV.AIDS.
lm_le_hiv_poly = lm(Life.expectancy ~ HIV.AIDS + I(HIV.AIDS ^ 2), data = data.train)
show_metrics(lm_le_hiv_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.0468046150825987 | 1.55206717595535e-06 | 0.487029498855961 | 6.33118333697985 | 39.6991749673099 | 43.4723916195385 |
# Polynomial transformation also helps.
lm_le_hiv_log_poly = lm(Life.expectancy ~ log(HIV.AIDS) + I(HIV.AIDS ^ 2), data = data.train) # doesn't further improve
show_metrics(lm_le_hiv_log_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.00686870383790006 | 5.87198474501198e-06 | 0.644741393902713 | 5.25324248579441 | 27.4937321554455 | 29.5878606382631 |
lm_le_thn = lm(Life.expectancy ~ thinness.5.9.years, data = data.train)
show_metrics(lm_le_thn)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.6062500515926e-20 | 1.91144004921282e-15 | 0.21336005686777 | 7.82538800946802 | 60.9315519907407 | 61.6658609753975 |
lm_le_thn_log = lm(Life.expectancy ~ log(thinness.5.9.years), data = data.train)
show_metrics(lm_le_thn_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 0.289116205119675 | 4.52200164957121e-21 | 0.316302351683306 | 7.29420952049337 | 52.9578483371689 | 57.4977265421249 |
# bp.test result is significantly improved with log transformation.
lm_le_thn_poly = lm(Life.expectancy ~ thinness.5.9.years + I(thinness.5.9.years ^ 2),
data = data.train)
show_metrics(lm_le_thn_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 9.56992897947193e-12 | 1.33708429290685e-12 | 0.362390018182186 | 7.04917200577612 | 49.3451186230709 | 51.7643974324823 |
# Polynomial transformation doesn't help a lot.
lm_le_icr = lm(Life.expectancy ~ Income.composition.of.resources, data = data.train)
show_metrics(lm_le_icr)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 5.39868610150089e-72 | 9.78699333659299e-26 | 0.559087343704458 | 5.87166945958373 | 34.1522098833106 | 45.020683088903 |
# lm_le_icr_log = lm(Life.expectancy ~ log(Income.composition.of.resources), data = data.train) couldn't perform log tranformation due to 0 values.
# bp.test result is significantly improved with log transformation.
lm_le_icr_poly = lm(
Life.expectancy ~ Income.composition.of.resources + I(Income.composition.of.resources ^ 2),
data = data.train
)
show_metrics(lm_le_icr_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.37401259204068e-21 | 1.9291859350126e-13 | 0.69484854321261 | 4.87547569854632 | 23.6159019817213 | 24.7461223228291 |
# Polynomial transformation helps a little.
lm_le_sch = lm(Life.expectancy ~ Schooling, data = data.train)
show_metrics(lm_le_sch)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 2.0127149476637e-09 | 2.6063913651081e-19 | 0.532073853020822 | 6.03052121613411 | 36.2446206824462 | 36.7431166111802 |
lm_le_sch_log = lm(Life.expectancy ~ log(Schooling), data = data.train)
show_metrics(lm_le_sch_log)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 5.3272402214625e-15 | 9.94327190004781e-20 | 0.498226490690044 | 6.24724276659015 | 38.8663694705834 | 39.1081458945406 |
# log transformation doesn't work.
lm_le_sch_poly = lm(Life.expectancy ~ Schooling + I(Schooling ^ 2), data = data.train)
show_metrics(lm_le_sch_poly)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 4.23704752476234e-08 | 2.84765808491887e-19 | 0.531961805515695 | 6.03435983226363 | 36.2218297792506 | 36.7111420288311 |
# Polynomial transformation doesn't help very much.
# New model based on the transformation result
model_trans = lm(Life.expectancy ~ log(Adult.Mortality) + infant.deaths + Polio + I(Polio ^ 2) + HIV.AIDS + I(HIV.AIDS^2) + log(thinness.5.9.years) + Income.composition.of.resources + Schooling + Adult.Mortality:HIV.AIDS, data = data.train)
show_metrics(model_selected_outliers)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 1.98371246826394e-22 | 0.0785951289988709 | 0.902757127129399 | 2.69776680411486 | 11.6767587462108 | 14.2696356612668 |
show_metrics(model_trans)
| bptest | shapiro_test | adj_r2 | LOOCV | TRAIN_RMSE | TEST_RMSE |
|---|---|---|---|---|---|
| 8.40002629693026e-11 | 3.43722742650363e-07 | 0.836954948860762 | 3.58210034060524 | 12.530477415429 | 15.8469601746367 |
# After transformation, it shows that bptest and shapiro test results improves a little.
MODEL COMPARISON + BEST MODEL + TRAIN vs TEST STATS
Ideas for furhter improvemnts
Put all not important stuff here